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Abstract. Online learning constitutes a mathematical and compelling 
framework to analyze sequential decision making problems in adversarial 
environments. The learner repeatedly chooses an action, the environment 
responds with an outcome, and then the learner receives a reward for the 
played action. The goal of the learner is to maximize his total reward. 
However, there are situations in which, in addition to maximizing the cu- 
mulative reward, there are some additional constraints on the sequence 
of decisions that must be satisfied on average by the learner. In this paper 
we study an extension to the online learning where the learner aims to 
maximize the total reward given that some additional constraints need 
to be satisfied. By leveraging on the theory of Lagrangian method in 
constrained optimization, we propose Lagrangian exponentially weighted 
average (LEWA) algorithm, which is a primal-dual variant of the well 
known exponentially weighted average algorithm, to efficiently solve con- 
strained online decision making problems. Using novel theoretical anal- 
ysis, we establish the regret and the violation of the constraint bounds 
in full information and bandit feedback models. 



Keywords: online learning, bandit, regret-minimization, repeated game play- 
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1 Introduction 

Many practical problems such as online portfolio management [1], prediction 
from expert advice [2,3], and online shortest path problem [4], involve making 
repeated decisions in an unknown and unpredictable environment (see, e.g. [5] 
for a comprehensive review). These situations can be formulated as a repeated 
game between the decision maker (i.e., the learner) and the adversary (i.e., the 
environment). At each round of the game, the learner selects an action from a 
fixed set of actions and then receives feedback (i.e., reward) for the selected ac- 
tion. In the adversarial or non-stochastic feedback model, we make no statistical 
assumption on the sequence of rewards except that the rewards are bounded. 
The player would like to learn from the past and hopefully make better decisions 
as time goes by, so that the total accumulated reward is large. 

The analysis of online learning algorithms focuses on establishing bounds on 
the regret that is the difference between the reward of the best fixed action with 
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the hindsight knowledge of the observed sequence and the cumulative reward of 
the online learner. If the online algorithm attains sublinear bound on the regret, 
is said to be Hannan consistent [5], which indicates that in the long run, the 
learner's average reward per round approaches the average reward per round 
of the best action. A point worthy of notice is that the performance bound 
must hold for any sequence of rewards, and in particular if the sequence is 
chosen adversarially. We also note that this setting differs from the framework 
of competitive analysis where the decision maker is allowed to first observe the 
reward vector, and then make the decision and get the reward accordingly [6]. 

In many current literature, the application of online learning is mostly limited 
to problems without constraints on the decisions. However, in most scenarios, 
beyond maximizing the cumulative reward, there are some restrictions on the 
sequence of decisions made by the learner that need to be satisfied on the average. 
Moreover, in some applications it seems beneficial to sacrifice some reward to get 
along with other goals simultaneously. Therefore, one might desire algorithms 
for a much more ambitious framework, where we need to maximize total reward 
under the constraints defined on the sequence of decisions. Attempts for such 
extension were made in [7] , where the online learning with path constraints has 
been addressed and algorithms with asymptotically vanishing bound have been 
proposed. 

As an illustrative example, let us consider a wireless communication system 
where the agent chooses an appropriate transmission power in order to trans- 
mit a message successfully. If one considers the amount of power required to 
transmit a packet through a path as its cost, the goal of the agent may be to 
maximize average throughput, while keeping the average power consumption un- 
der some required threshold. As another motivating example, consider the online 
ads placement with budgeted advertisers. This problem can be cast as a multi 
armed bandit (MAB) problem, with the set of arms being the set of ads. Since 
each advertiser has a limited budget to represent his adds, the online learner 
must consider the budget restriction of each advertiser in making decisions. 

To model abovementioned situations, we consider modifying the online learn- 
ing problem to achieve both goals simultaneously where the additional goal is 
called constraint throughout the paper to distinguish it from the regret. Roughly 
speaking, we try to devise online algorithms in order to maximize the revenue 
and to some degree guarantee vanishing bound on the additional constraint. The 
constraint defined over the actions necessitates a compromise: if the algorithm 
be too aggressive to satisfy the constraint, then there would be less hope to 
attain satisfactory cumulative reward at the end of the game and on the other 
hand, just trying to maximize the cumulative reward will end up in a situation 
in which the constraint vanishes linearly in terms of the number of rounds. 

An algorithm addressing this problem has to balance between maximizing 
the adversary rewards and satisfying the constraint. To affirmatively address the 
problem, we provide a general framework for repeated games with constraint, 
and propose a simple randomized algorithm called Lagrangian exponentially 
weighted average (LEWA) algorithm for a particular class of these games. 
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The proposed formulation is inspired by the theory of Lagrangian method in 
constrained optimization and is based on primal-dual formulation of the expo- 
nentially weighted average (EWA) algorithm [3] [8]. To the best of our knowledge, 
this is the first time a Lagrangian style relaxation has been proposed for this 
type of problem. 

The contribution of the present work is to 1) introduce a general primal-dual 
framework for solving online learning with constraints problem; 2) propose a 
Lagrangian based exponentially weighted average algorithm for solving repeated 
games with constraints; 3) establish expected and high probability bounds on 
the regret and the violation of the constraints on average; 4) extend the results 
to the bandit setting where only partial feedback about the rewards and con- 
straints are available. 

Notations. Before proceeding, we define the notations used in this paper. Vec- 
tors are indicated in lower case bold letters such as x where x T denotes it 
transpose. By default, all vectors are column vectors. For a vector x, Xi denotes 
its ith coordinate. We use superscripts to index rounds of the game. Component- 
wise multiplication between vectors is denoted by o. We use [K] as a shorthand 
for the set of integers {1,2,..., K}. Throughout the paper we denote by [•] + the 
projection onto the positive orthant. We shall use 1 to denote the vector of all 
ones. Finally, for a if-dimcnsional vector x, (x) 2 represents (x\, . . . ,x 2 K ). 

2 Statement of the Problem 

We consider the general decision-theoretic framework for online learning and 
extend it to capture the constraint. In original online decision making, the learner 
is given access to a pool of K actions. In each round t £ [T], the learner chooses 
a probability distribution p t = over the actions [K] and chooses 

an action i randomly based on p t . In the scenario of full information, at each 
iteration, the adversary reveals a reward vector r t = (rf, ■ • • , t - ^). Choosing an 
action i results in receiving a reward r\, which we shall assume without loss of 
generality to be bounded in [0,1]. In the partial information or bandit setting, 
only the cost of selected action is revealed by the adversary. The learner competes 
with the best fixed action in hindsight and his/her goal is to minimize the regret 
defined as 



This problem is a well studied problem and there are algorithms which attain 
an optimal regret bound of O(yThiK) after T rounds of the game. In this paper 
we focus on exponentially weighted average (EWA), which will be used later as 
the baseline of the proposed algorithm. The EWA algorithm maintains a weight 
vector w t = (w\, • • • , w* K ) which is used to define the probabilities over actions. 
After receiving the reward vector r t at round t, the EWA algorithm updates the 
weight vector according to w* +1 — w\ cxp^r*) where r\ is learning rate. 



T 



T 
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In the new setting addressed in this paper, which we refer to as constrained 
regret minimization, in addition to the rewards, there exist some constraints on 
the decisions that need to be satisfied. In particular, for the decision p made by 
the learner, there is an additional constraint p T c > Co where c is a constraint 
vector for specifying the constraint (e.g. the cost vector for the arms in MAB 
problem). We note that, in general, the reward vector r t and the constraint vector 
c are different and can not be combined as a single objective. The learner's goal 
is to maximize the total reward with respect to the optimal decision in hindsight 
under the constraint p T c > cq, i.e., 

T T 

min max p T i"t — Pt r t, 

pi,...,p T p T c>c ^ ^ 

and simultaneously satisfy the constraint. Note that the comparator class in- 
cludes fixed decision p that attains maximal cumulative reward had he known 
the rewards beforehand, while satisfying the additional constraint. 

Within our setting, we consider repeated games with adversarial rewards 
and stochastic constraint. More precisely, let c = (ci, ■ •■ , ck) be the constraint 
vector defined over actions. In stochastic setting the vector c is unknown to 
the learner and in each round t € [T], beyond the reward feedback, the learner 
receives a random realization c t = (c*,--- ,c* K ) of c where E[c*] = c*. The 
learner's goal is to choose a sequence of decisions p t ,t £ [T] to minimize the 
regret with respect to the optimal decision in hindsight under the constraint 
p T c > Co- Without loss of generality we assume c t £ [0, 1] K and Co £ [0,1]. 
Formally, the goal of the learner is to attain a gradually vanishing constrained 
regret as 

Regret T = max V p T r t - V p t T r t < 0(r 1 "' 31 ). (1) 

P T c>c t t 

Furthermore, the decisions p t , t = 1, ■ ■ • , T made by the learner are required to 
attain sub-linear bound on the violation of the constraint in long run, i.e., 



Violation^ = 



]T (co - pjc) 



< 0(T 1 -' 32 ). (2) 



We refer to the above bound as the violation of the constraint. We distinguish 
two different types of constraint satisfaction algorithms: one shot and long term 
satisfaction. In one shot constraint satisfaction, the learner is required to sat- 
isfy the constraint at each round, i.e., pjc > cq. In contrast, in the long term 
version, the learner is allowed to violate the constraint for some rounds in a 
controlled way; but the constraint must hold on average for all rounds, i.e., 
(ELiP7c)/T> Co . 

The main questions addressed in this paper are how to modify EWA algo- 
rithm to take the constraints under consideration and what would be the bounds 
on the regret as well as the violation of the constraints attainable by the modified 
algorithm. 
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3 Related Works 



As is well known, a wide range of literature deals with the online decision making 
problem without constraints and there exist a number of regret-minimizing algo- 
rithms that have the optimal regret bound. The most well-known and successful 
work is probably the Hedge algorithm [8], which was a direct generalization of 
Littlestone and Warmuth's Weighted Majority (WM) algorithm [3]. Other recent 
studies include the improved theoretical bounds and the parameter-free hedging 
algorithm [9] and adaptive Hedge [10] for decision-theoretic online learning. We 
refer readers to the [5] for an in-depth discussion of this subject. 

As the first seminal paper in adversarial setting, Mannor et al. [7] introduced 
the online learning with simple path constraints. They considered the infinitely 
repeated two player games with stochastic rewards where for every joint action 
of the players, there is an additional stochastic constraint vector that is accu- 
mulated by the decision maker. The learner is asked to keep the cumulative 
constraint vector in a predefined set in the space of constraint vectors. They 
showed that if the convex set is affected by both decisions and rewards, the opti- 
mal reward is generally unattainable online. The positive result is that a relaxed 
goal, which is defined in terms of the convex hull of the constrained reward in 
hindsight is attainable. For the relaxed setting, they suggested two inefficient al- 
gorithms: one relies on Blackwcll's approachability theory and the other is based 
on calibrated forecast of the adversary's actions. Given the implementation diffi- 
culties associated with these two methods, they suggested two efficient heuristic 
methods to attain the reward with meeting the constraint in the long run. We 
note that the analysis in [7] is asymptotic while the bounds to be established in 
this work are applicable to finite repeated games. 

In [11] the budget limited MAB was introduced where polling an arm is 
costly where the cost of each arm is fixed in advance. In this setting both the 
exploration and exploitation phases arc limited by a global budget. This setting 
matches the stochastic rewards with deterministic constraints without violation 
game discussed before. It has been shown that existing MAB algorithms arc 
not suitable to efficiently deal with costly arms. They proposed the e — first 
algorithm that dedicates the first e fraction of the total budget exclusively for 
exploration and the remaining (1 — e) fraction for exploitation. [12] improves 
the bound obtained in [11] by proposing a knapsack based UCB [12] algorithm 
which extends the UCB algorithm by solving a knapsack problem at each round 
to cope with the constraints. We note that knapsack based UCB does not make 
explicit distinction between exploration and exploitation steps as done in e— first 
algorithm. In both [12] and [11] the algorithm proceeds as long as sufficient 
budget existing to play the arms. 

Finally, we remark that our setting differs from the setting considered in [13] 
which puts restrictions on the actions taken by the adversary and not the learner 
as in our case. 
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In this section, we present the basic algorithm for the online learning with con- 
straint problem and analyze its performance via the primal-dual method in ad- 
versarial setting. 

A straightforward approach to tackle the problem is to modify the reward 
functions of the learner to include constraint term with a penalty coefficient 
that adjust the probability of the actions when the constraint is violated. This 
approach circumvents the problem of a constrained online learning by turning 
it into an unconstrained problem. But a simple analysis shows that, in the ad- 
versarial setting, this simple penalty based approach fails to attain gradually 
vanishing bounds for regret and the violation of constraint. The main difficulty 
arises from the fact that an adaptive adversary can play with the penalty co- 
efficient associated with the constraint in order to weaken the influence of the 
penalty parameter which results in linear bound on at least one of the measures, 
i.e. either regret bound or violation of the constraint. 

Alternatively, since the constraint vector in our setting is stochastic, one pos- 
sible solution is to take an exploration and exploitation scheme, i.e., to burn a 
small portion e of the rounds to estimate the constraint vector c by c and then 
in the remaining (I — e)T rounds follow the existing algorithms with restricted 
decisions, i.e., p S Ak H p T c > cq, where Ak is the simplex over [K]. The 
parameter e balances the accuracy of estimating c and the number of rounds 
for exploitation to increase the total reward. One may hope that by careful ad- 
justment of e, it would be possible to get satisfactory bounds on regret and the 
violation of the constraint. But unfortunately this naive approach suffers from 
two main drawbacks. First, the number of rounds T is not known in advance. 
Second, the decisions are made by projecting into an estimated domain p T c > cq 
instead of the true domain p T c > cq which is problematic as follows. In order 
to show the regret bound, we need to relate the best cumulative reward in the 
estimated domain to that in the true domain, which however requires impos- 
ing a regularity condition on reward and constrain vectors to be solvable [14]. 
Basically, we can make the algorithm adaptive to T by using a similar idea to 
epoch greedy [15] algorithm that runs exploration/exploitation in epochs, but it 
still suffers from the second drawback. Additionally, projection to the inaccurate 
estimated constraint c does not exclude the possibility that the solution will be 
infeasible. 

Here we take a different path to solve the problem. The proposed algorithm 
is inspired by the theory of Lagrangian method in constrained optimization. 
The intuition behind the proposed algorithm is to optimize one criterion (i.e., 
minimizing regret or maximizing the reward) subject to explicit constraint on 
the restrictions that the learner needs to satisfy in average for the sequence of 
the decisions. A challenging ingredient in this formulation is that of establishing 
bounds on the regret and the violation of the constraint. In particular, our 
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LEWA (77 and 5) 

initialize: wi = 1 and Ai = 
iterate t = 1,2, . . . ,T 
Draw an action accordingly to the probability p t = w*/ Wj 

j 

Receive reward rt and a realization of constraint ct 
Update Wi+i = w< o expert + A t c t )) 
Update A t +i = [(1 - Srj)\ t - n(pj c t - co)]+ 
end iterate 



Fig. 1. Lagrangian exponentially weighted average for full information online decision 
making under constraints 



algorithms will exhibit a bound in the following structure, 

_ Violation^ «n ,„ N 

Regret T + Q(rl _ a) T < OiT 1 -?), (3) 

where Violation^ is a term related to the violation of the constraint in long term. 
From (3) we can derive a bound on regret and the violation of the constraint as 

Regret T < 0(r 1 ~' 3 ) (4) 



Violation T < ([T + T^T 1 -"), (5) 

where the last bound follows the fact — Regret T < 0(T). 

The detailed steps of the proposed algorithm are shown in LEWA. The al- 
gorithm keeps two set of variables: the weight vector w t and the Lagrangian 
multiplier At. The high level interpretation of the algorithm is as follows: if the 
constraint is being violated a lot, the decision maker places more weight on the 
constraint controlled by At ; but it tunes down the weight on the constraint when 
the constraint is satisfied reasonably. We note the LEWA is equivalent to the 
original EWA when the constraint is satisfied at each iteration, i.e., p^Ct > Co, 
which gives Ai = • • • = At = . . . = 0. It should be emphasized that in some 
previous works such as [11], the learner is not allowed to exceed the pre-specified 
threshold for the violation of the constraint and the game stops as soon as the 
learner violates the constraint. In contrast, within our setting, the learner's goal 
is to obtain sub- linear bound on the long term violation of the constraint. 

We now state the main theorem about the performance of LEWA algorithm. 



Theorem 1. Let pi,P2, - "" ,Pt be the sequence of randomized decisions over 
the set of actions [K] := {1, 2, ■ • • , K} produced by LEWA algorithm under the 
sequence of adversarial rewards ri,r2, ■■■ , Yt & [0, 1] K observed for these de- 
cisions. Let Ai,A2,--- , At be the corresponding dual sequence. By setting r\ = 
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\J A In K / (9T) and 8 = n/2 we have: 



max 2_] P Tr t — E 

P T c>c t=1 



Ep t T 

t=l 



1"/ 



< 3V^TnA and 



E 



^(c -p t T c) 



< 0(T 3 / 4 ), 



where expectation is taken over randomness in Ci , • • • , Ct ■ 

From Theorem 1 we see that the LEWA algorithm attains the optimal bound 
for the regret and an 0(T 3 / 4 ) bound on the violation of the constraint. Before 
proving the Theorem 1, we state two lemmas that pave the way to the proof of 
theorem. 



Lemma 1. [Primal Inequality] Let R t = Rj + AfR 2 , where R*,R 2 € 
w t+ i = w t oexp(?7R t ), andpt =vr t /vrjl. Assuming max(||Rj H^, ||Rf H^) < s, 
we have the following primal equality 



(6) 



Proof. Let Wt = Yl^-i w\. We first show an upper bound and a lower bound on 
lnWr+i/W 7 !, followed by combining the bounds together. We have 

y-v ln Wt+1 _ ln W T +i 



i = i 



K 



K 



In wf +1 - In K > ln ^ Piwf +1 - ln K > np T ^ Rf ln A", 



where the last inequality follows from the concavity of the log function. By 
following Lemma 2.2 in [5], we obtain 



w, ^ ^ 



T K 



w\ exp^iJf) 



1 = 1 



W, 



1 i—\ <L-ij=l 



K t 



T K 

^EE 



3 = 1 ~3 
2 



EK t 
j=l W j t=l " t=l 



Combining the lower and upper bounds and using the inequality (a + 6) 2 < 
2(a 2 + b 2 ), we obtain the desired inequality in (6). 

Lemma 2. [Dual Inequality] Let <7t(A) = §A 2 + A(/?t — Co), At+i = [(At — 
rjV gt(\t)] + , and Ai = 0. Assuming r\ > 0, < /3 t < /3o, we ftaue 

f; (At - A)(ft - co) + \ E ( A * - A ') ^ £ + ( c o + ^o)^- (7) 
t=i t=i ^ 
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Proof. First we note that 

A t +i = [A t - r)Vg t (\t)]+ 

= [(1 - 6r])X t - r/iPt - co)]+ < [(1 - 5r))\ t + r,Co} + . 

By induction on At, we can obtain At < — . Applying the standard analysis of 

6 

online gradient descent [16] yields 

|A t+ i - A| 2 = \n+[\t - ri(6X t + A - c )] - A| 2 

< \\t - A| 2 + \r)(6\ t - c ) + vM 2 - 2(A t - X)(riVg t (X t )) 

< \X t - A| 2 + 2t7 2 c 2 + 2t7 2 /? 2 + 2 V (g t (X) - g t (X t )). 
Then, by rearranging the terms we get 

gt(Xt) - 9t(X) < ^ (|A t+1 - A| 2 - \X t - A| 2 ) + V (cl + ft). 

Expanding the terms on l.h.s and taking the sum over t, we obtain the inequality 
as desired. 

Proof, [of Theorem 1] Applying R t = r t + XtC t to the primal inequality in 
Lemma 1, where max(||rt||oo, HctHoo) < 1, we have 

f:(p-p,)>, + A,c<^ + f 

+ — 1 ' + — 1 



Applying f3 t = pjc t to the dual inequality in Lemma 2, where fit < l,co < 1, 
we have 

J2 (A* - A)(p t T c t - Co ) + S - J2 (A* - A 2 ) < ^ + 2rfT. 
t=i t=i " ' 

Combining the above two inequalities gives 

T T 



^(p T r t - p t T r t ) + Y A(c - pjc t ) 



1=1 



^ + 1)A 2 
2 2 V/ ] 



£^ + ¥ + (H)E* + I>(*-P T «). 



4 2 



t=i t=i 



Taking expectation over c t ,t = 1, ■ ■ ■ , T, by using E[c f ] = c and noting that pt 
and At arc independent of c t , we have 



E 



< 



t=i t=i 

\nK 9 



- V T + E 



T 



t=l 



E 



^ + 1)A 2 
2 2r? 1 



^ A t (c - p T c) 
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High Probability LEWA (77, 5 and e) 

initialize: wi = 1 and Ai = 
iterate t = 1, 2, ...,T 
Draw an action accordingly to the probability pt = w t / uhj . 

3 

Receive reward vt and a realization of constraint c t 



Compute average constraint estimate c< = 




3 = 1 



Update wt+i = Wt o exp(r?(rt + A t c t )) 
Update At+i = [(1 — 5i])\ t — ri(pjc t + a t — c )] + . 
end iterate 



Fig. 2. High Probability LEWA 



Let p be the solution satisfying p c > c$. Noting that \ — | < and taking 
maximization over A > in l.h.s, we get 



E 



max y p T r t 

p t c>c t=1 



E 



ELi(co 



2(5T+l/77) 



< 



In K 9 



By plugging the values of 77 and 5, and noting the similar structure of above 
inequality as in (3) and writing in (4) and (5) formats, we obtain the desired 
bound for regret and the violation of the constraints in long term. 

Remark 1. We note that when deriving the bound for Violation^, we simply 
use a weak lower bound on regret as Regret T > — T. It is possible to obtain 
an improved bound by considering tighter bound for the Rcgrct T . One way 
to do this is to bound the regret by the variation of the reward vectors as 
Variation-r = Y^t=i ll r t ~ ^t||oo, where ?t = (1/^) J2t=i r * denotes the mean of 
r (l tG [T] . The analysis in A bounds the violation of the constraint in terms of 
VariationT as 



' T 

.t=i 



x7c) 



< 0{VT) + 0(T 1/4 VVariation T ). 



This bound is significantly better when the variation of the reward vectors is 
small and in worst case it attains an 0(T 3 / 4 ) bound similar to Theorem 1. 



4.1 A High Probability Bound 

The performance bounds proved in the previous section for the regret and the 
violation of the constraint only holds in expectation which may have enormous 
fluctuations around its mean. Here, with a simple trick, we present a modified 
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version of the LEWA algorithm which attains similar bounds with overwhelming 
probability. To this end, we slightly change the original LEWA algorithm. More 
specifically, instead of using c t in updating Xt+i, we use the average estimate and 
add a confidence bound to achieve a more accurate estimation of the constraint 
vector c. The following theorem bounds the regret and the violation of the 
constrain in high probability for the modified algorithm. 

Theorem 2. Let a t = -^(1/2) ln(2/e), r) = 0(T" 1/2 ) ; and S = rj/2. By 
running Algorithm 2 we have with probability 1 — e 



max 

P T r* - p ^ rt - °( Tl/2 ) and 



p T c>c 



1=1 



t=l 



t=l 



< 0(T 3 / 4 ), 



where O(-) omits the log term in T. 

Proof. Applying R t = r t + XtC t to the primal inequality in Lemma 1, where 
maxdlrdloo, Hc^oo) < 1, we have 



£(P - PtV \r t + X t ct) < 



In K nT n v-^ , 2 
+ "T" + T/ X t- 

' t=i 



Applying f3 t — pj c t + at to the dual inequality in Lemma 2, where ft < 1 + a%, 
and Co < 1, we have 

]T (At - A)(p7c t + at -c )+ S -J2 (A? - A 2 ) < ^ + [1 + (1 + a 1 f]r 1 T. 
t=i t=i ^ 



Combining the above two inequalities results in 

T T 

(P Tr * ~ Pt Tr *) + A ( c o - PtCt - at) - 



ST 



1 



A 



< 



In A 



n 



2 2r] 

' T 

^ Af(c - p T c t - a t ) 
t=i 



Let p be the solution satisfying p T c > cq. Noting that j — | < 0, and with a 
probability 1 — e, 

|p T c - p T c t | < a t , 

which is due to the Hoeffding's inequality [17], by taking maximization over 
A > on the l.h.s, we have with a probability 1 — eT, 

T T- 1 2 

Et=i( c o ~ P* c t - a t ) 



max p 

P T c>c t=1 



2(*T+1/jj) 



< 



In A" 



'/ 



13 



2a{ rfT 
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Pluging the stated values of rj and 8, we have, with a probability 1 — eT, 

T 

max V p T r t - pjr t < O (t 1/2 ln(l /e) 
p T c >c t=i V 

T 



£(c - p t T c) 



t=l 



< y/(T + TVa ln(l/e))TV2 + £(p7c t + a t - pjc) 

t 

T 

< o(T 3 / 4 ) + 2 ^ a t < 0(T 3 / 4 ) + (t 1 / 2 ln(l/e)) . 



t=i 



By replacing e with e/T and noting that C^T 1 / 2 InT) < 0(T 3/4 ), we obtain the 
results stated in the theorem. 



5 Bandit Constrained Regret Minimization 

In this section, we generalize our results to the bandit setting for both rewards 
and constraints. In the bandit setting, at each iteration, we are required to 
choose an action i t from the pool of the actions [K] . Then only the reward and 
the constraint feedback for action i t arc revealed to the learner, i.e. r| , c* .In this 

case, we are interested in the regret bound as max p T c>Co J2t=i P Tr * — J2t=i r l t - 
In the classical setting, i.e., without constraint, this problem can be solved in 
stochastic and adversarial settings by UCB and Exp3 algorithms proposed in [18] 
and [19], respectively. The algorithm is shown in BanditLEWA algorithm which 
uses the similar idea to Exp3 for exploration and exploitation. 

Before presenting the performance bounds of the algorithm, let us introduce 
two vectors: r t is all zero vector except in i^th component which is set to be 
r\ = t\/p\ and similarly c t is all zero vector except in i t th component which is 
set to be cf = c* Jp\ . It is easy to verify that Ej t [r t ] = r t and E, t [c t ] = c t . The 
following theorem shows that BanditLEWA algorithm achieves 0(T 3//4 ) regret 
bound and 0(T 3 / 4 ) bound on the violation of the constraint in expectation. 



Theorem 3. Let 7 

rithm, we have 



C(T- 1/2 ),?7 



7 



K 5 + 1 



by running BanditLEWA algo- 



max 

p T C>C 



E 



£p T r t -E 

t=i 

J2(c -pjc) 



J2' 



<0(T 3 / 4 ) and 



< 0(T 3 / 4 ). 



-i + 



Proof. In order to have an improved analysis, we first derive an improved primal 
inequality and an improved dual inequality. Let Rj = % + \ t c t . By following the 
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BanditLEWA (77, 7, and 5) 

initialize: wi = 1 and Ai = 
iterate t = 1,2, . . . ,T 
Set q t = w t / y~] w) 

j 

Draw action i t randomly accordingly to p t = (1 — j)qt + 7— 

Receive reward r* 4 and a realization of constraint c| t for action i t 
Update w^ +1 — wl exp(j](fl + AtCj)) 
Update At+i = [(1 - -yri)\t - ri(qjc t - c )]+ 
end iterate 



Fig. 3. Constrained regret minimization with partial (bandit) feedback about reward 
and constraint vectors 



analysis for Exp3 algorithm [19], we have 

T 



(8) 



Dividing both sides by 77, and taking expectation we get 



E 



< 



t=l 
InK 



n 



TiE 



t=l 

T 



InK 

< +??E 



*=i 



E2q7(? t ) 2 + 2A?q t T (c t ) 



,t=i 



ln K 2nKT 2r)K ^ , 
< + + — — V A 

1 _ ~ 



1-7 



(9) 



where the third inequality follows from the following inequality 



E[q7(c t ) 2 ] =E 



1-7 



-E 



A 
r «) 21 



< 



1 



p 



1-7 
1 



-E 



Pit 

K 



l- 7 



-E 



Pit 

T(cD 



1=1 



< 



(10) 



and the same inequality holds for E[q^~(rt) 2 ]. Next, we let <?t(A) = |A 2 +A(q t r c t - 
Cq). By following the similar analysis in the proof of Lemma 2, we have 

9t(Xt) - 5t(A) < (|A - A t | 2 - |A - A m | 2 ) + ||V. 9t (A t )| 2 

< ^ (|A- A t | 2 - |A - A m | 2 ) + ?/(q^) 2 + V- 
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Taking summation and expectation, we have 

T 



E 



$>(A t )-<7t(A) 



t=i 



A 2 

< 5- + f?E 



E^ T ( 



?7T. < — 



A 2 ryATT 



2r? I-7 



(11) 



Combining equations (11) and (9) gives 



E 



< 



E 



EA(co-q7c)-(f + i-)A 2 



In A" Ar/KT ( 2r)K 7 



1-7 



2 277 



^A t (c -p T c) 



Noting that (1 — 7)q t < p t , so we get 



E 



E^ 1 ~ 7)P Tr t ~ P* r t 



In A 



E 



^A((l- 7 )co- Pt T c)- 



t=i 



- + -I 

2 + 2rJ' 



< + A V KT+ 2? ? A-(l- 7 )- E A ?+ E 



t=i 



E A *( c o - p T c) 



t=i 



Let cq > p T c, 2?yA' < (1 — 7)5. By taking maximization over A, we have 



E 



max Y]p T r t - pjr t 

P Tc> C0 1=1 



E 



ELi((l-7)co-p t r c) 



2{5T+l/n) 



In A A(<5 + l)lnA j6 
< + Ar/KT + jT = v ; + 4-^ — T + 7T 



5^ 



5 + 1 



< AQ5 + l)lnA + 55+1 ^ < J (56+l)K\nK T 



8"i 

Then we obtain 



5 + 1 



max 

p T C>C 



/=1 



' T 



t=l 



E 



E( c ° - p* Tc ) 



< 



< 



(56 + 1) if In A 



T 



\ 



T 



'56+ 1) A In A 



T 2(6T+l/i])+~fT. 



Let 7 = 0(T -1 / 2 ), 5 = 0(T- 1 /2) ; thcn we get Q(T 3 / 4 ) regret and C>(T 3 / 4 ) 
constraint bounds as claimed. 

As our previous results, we present an algorithm with a high probability bound 
on the regret and the violation of the constraint. For ease of exposition, we intro- 
duce c t = j J2 s =i Cs an( l = jY^ s =i^ s - We m °dify BanditLEWA algorithm 
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High Probability BanditLEWA (r/, 7, 5, and e) 

initialize: wi = exp ypay/ KTj 1, and Ai = , where a = 2^/ln(4AT/e 
iterate t = 1,2, ...,T 

Set qt = wt/ £V ™j 

Set p t = (1 - 7)q< +j/K 

Draw action it randomly accordingly to the probabilities pt 
Receive reward r* t and a realization of constraint c* t for action it 
Update Wt+i by 

2K qi 



Update A i+ i 
end iterate 



[(!• 



exp 77 \r l + 



5r;)A t - ry(x t ' c t + a t - c )]- 



At c, H -= 

7 \A 



so that it uses more accurate estimations rather than using correct expectation 
in updating the primal and dual variables. To this end, we use upper confidence 
bound for rewards as Exp3.P algorithm [18] and for constraint vector c. The 
following theorem states the regret bound and the violation of constraints in 
long term for the high probability BanditLEWA. 

c 

Theorem 4. Let a t = y/{l/2) ln(6KT/e)/V~t, 7 = O^ 172 ),^ = Tj^^p 

and a = 2y / 'hi(4KT/e), where /3 = max{3, l + 2ai}, by running High Probability 
BanditLEWA, we have with probability 1 — e 



max p T V r t - V r\ < 0(T 3 / 4 /^) and 



T 

^(c - p^c) 
t=i j + 



< O(VST). 



The proof is deferred to B. From this theorem, when 8 = 0(T x / 4 ), the regret 
and the violation bounds arc 0(T 7 / 8 ) and 0(T 7 / 8 ), respectively. 



6 Conclusions and Future Works 



In this paper we proposed an efficient algorithm for regret minimization under 
stochastic constraints. The proposed algorithm, namely LEWA, is a primal-dual 
variant of the exponentially weighted average algorithm and relies on the the- 
ory of Lagrangian theory in constrained optimization. We establish expected 
and high probability bounds on the regret and the long term violation of the 
constraint in full information and bandit settings using novel theoretical analy- 
sis. In particular, in full information setting, LEWA algorithms attains optimal 
O(VT) regret bound and 0(T 3 / 4 ) bound on the violation of the constraints in 
expectation, and with a simple trick in high probability. 

The present work leaves open a number of interesting directions for future 
work. In particular, extending the framework to handle multi-criteria online 



16 M. Mahdavi, T. Yang, and R. Jin 



decision making is left to future work. Turning the proposed algorithm to the 
one which exactly satisfies the constraint in the long run is also an interesting 
problem. Finally, it would be interesting to see if it is possible to improve the 
bound obtained for the violation of the constraint. 
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A Variation Bound for Violation of the Constraint 

Previously, when deriving the bound for the violation of the constraint, we simply 
bound the regret as Y^t=i(P T r t ~ Pt r t) > —T. Since this simple lower bounding 
seems to be weak in general, we present variation based bound for the violation of 
the constraint which results in significantly improved bounds when the variation 
of the consecutive reward vectors is small. For example, when the rewards vectors 
are correlated, the variation will be smaller than T. We note that bounding the 
regret in terms of the variation of the reward vectors has been investigated in 
few recent works [20,21] and online learning algorithms with improved regret 
bound have been developed. To this end, let tt — f Y^t=i r * denote the mean 
of reward vectors r t , t = 1, • • • , T, and define the variation in the reward vectors 
as 



VariationT = /J 1 1 r t — ?t | 



Then we have 

T 



^2(pJ r t - P T r*) = ^p t T (r t - r T ) + (pJr T - p T r T ) + P T (?t - r t ) 



t=i 



< 2 VariationT + Pt?T — P T ?t 

t=i 

< 2 VariationT + T(pJ?T — p t ?t), 



where pt = f X)t=i Pt- The following lemma bounds the second term in above 
inequality. 

Lemma 3. Let p = arg max xgzljX T c > Co x t ?t, then 

T 



C 

pf r T - p T r T < 



J2(ca - pjc) 



where C is some constant and A — {a € : Ylj—i <%i = 1} is the simplex. 
Proof. Let h (7) denote 

/i(7) = maxx T ?T, s.t. Co — x T c < 7. 
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We assume cq — pj c > 0, otherwise the bound is trivial. Then 

pjry — p T ?r < h{co — xjc) — h(0). 
Introducing Lagrangian multiplier, 

h("f) = minmaxx 1 ?^ + ^(7 — cq + x T c) 
= min max x T (r t + /Uc) — /icq + /ij 

/i>0 xG/1 

Since + ^7 is concave in 7, therefore /i(7) is also a concave function in 7. 
Then we have 

pj?r — p T ?T < h(c — pjc) — /i(0) 
<fc'(0)(co-pJc) 

^(c - pjc) 



where the last inequality follows that fact that ^1(7) is a monotonically increasing 
function, i.e., h'(0) > 0. 

From the proof, the condition in Lemma 3 holds if h'(0) exists. In order to show 
h'(0) exists, we need to show that the linear system, 

maxx T r, s.t. c - c T x < 0, x T l = 1, x > (12) 

X 

and its dual satisfy the regular condition. In order to represent the above linear 
programming problem in a standard form, we let A = (— c, 1,— 1) and u = 
(—Co, 1, — 1) T , and rewrite the linear system in (12) as 

max x T r 



and its dual problem as 



s.t. Ax < u, x > 0, 



mm y u 

y 



s.t. y>0,A'y>r- 

To show the system satisfy the regular condition, we need to show that 

x >z 0, Ax < =S> x T r < (13) 
y >z 0, A T y > => y T u > (14) 
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where >z denotes at least one element is positive, which is also termed semi- 
positive. To prove (13), note that there does not exists any semipositive vector 
x such that x T l = 0. Therefore the primal system satisfy the regular condition 
vacuously. Although the primal system does not satisfy the regular condition, 
the dual system still satisfy the regular condition as long as cq < c max - The 
gradient h'(0) is actually the Lagrangian variable when 7 = 0. The following 
lemma verifies the existence of h'(0). 

Lemma 4. Let x > 0,x T l = l,x T c > cq be strictly feasible or Co < maxj Ck, 
then their exists bounded gradient h'(0). 

Following Lemma 3, we have 

{pjrt — p T r*) < 2 Variation^ + C 



T 

£< 

*=i 



T 

t=i 



Then we have 



' T 

£( 

.t=i 



(co 

Then we get 



Pt c ) 



<0[ Vt) 2 Variation T + C 



■ T 

E 



(co - pjc) 



o (VT) + o{VT) 



' T 

E 

t=i 



(cq - p^c) 



< 0{VT) + 0{T l/i yJ Variation T ) 



B Proof of Theorem 4 

Similar to the analysis for Exp3.P algorithm in [18], we have have the following 
two upper confidence bounds, 



E 



E^ + c^E'iv* 
t=i *=i 

c1+ I >E C *- V * 



7 \/i 



(15) 
(16) 



where erf = VKT + £* =1 VK ■ Following the same line of proof as in [18], 
we have 



1=1 



t=i 



j 5 



' t=l i 



A?c t ' 1 



7(1 — 7) 7 2 (5 2 



(l + ln(T)) 
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and 
T 



t=i 



2K 



v + acjt i + A *^ + A * — a *) - p T H( rt + A * 5 *) 
\t=i * *=i / 



Then wc have 



t / t 2k T \ 

P T 51 ( r * + A * S *) + + afT > + At ^ + A * — a *) _ P T zO r * + AtS *) - 1*"^' + A * H *) 

t=i \t=i ' t=i / 



< 



4// 



1 — 7 7(5 

4a 2 ?y l^a 2 ?^ 
7(1 — 7) 7 2 i5 2 



7 

(1 + lnT) 



t=l i 

In A 



On the other side, let <?t(A) = §A + A(q t c t + a t — Co), with probability 1 — e/4, 



wc have 



9.(A,)-p,(A)< — (|A-A,| J 

^(|A-A,P 
^(|A-A,P 
^(|A-A,P 



1 

2rj 



<^{\X-Xt\ 2 



A-A m | 2 ) + ||V Afft (A t )| 2 
A - A t+ i | 2 ) + r//2(x t r c t - c + a t + <5A t ) 2 
A-A t+1 | 2 ) +7 7 (x7c t ) 2 + ?? C 
A-A t+1 | 2 ) + V xJ (c t ) 2 + V C 



A-A t+1 | 2 ) + 



1-7 



1 1 c t + r/C 



A - A f+ i| ) + 



(1 1 c+ — a t ) + 77C 

1-7 7 



where C = (1 + ai) 2 , at = a\/yt. Taking summation over t = 1, • • • ,T of above 
inequalities, we have 



^2 ^^t ~ M c o ~ " t - q t T c t ) + A(c 



t=i 



a t - q t c t ) - -A 



Zn L — ' 1 — 7 7 
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Combing the primal inequality and the dual inequality, we have 



IK 



T 



t=l 



^p T (r t + A t c t ) - q t T r t - A t (c - a t ) + ~A 2 + A(c - a t - q^ct) - 7A 2 



< 



A 2 T 



2'/ 



^1-7 



t=l 
T 



c + — a t ) + vCT + VKT + -^—VT • 

7' 7 J-~7 70 7 



4a 77 16afr]K 



t=i 



7(1 — 7) 7 2 5 2 



(l + lnT) + 



7<5 
In if 



'/ 



Then with probability 1 — e, we have the following inequality: 



5^(fJ + aa\ + A t 2* + A t a t ) - p T J2(r t + A t c t ) 



V {=1 
T 



7 



t=l 



p T (r t + A t c t ) - q t T ?t - A t (c - a t ) + -A 2 + A(c - a t - q^Ct) - ^A 2 

t=i 



< 



f + 53_2_(it c + £ Q() + ^ ct + 

2?7 ^— ' 1 — 7 7 1 



2r? —1 

4// 



t=l 



Ti | 4a 2 ?7 16a\rjK 
7(1 — 7) 7 2 5 2 



7 



(l+hiT) + 



In A" 
>7 



i=l i 



Let U T = max, E*=i(^ + ao\ + A t (cJ + ^a,)), V = ^ksTT^ ^ OW + ^ 
then we have 



1 - 



4-} 



13(1 -l) 



Ut -p T ^(r t + X t ct) 



t=i 



+ ^p T (r t + A t c t ) - q t T r t - A t (c - a t ) + ^ A(c - a t - qjc t ) 



t=i 

T 



<]T-^(l T c+-a t ) + ?? CT+- 
^-^ 1 — 7 7 1 



t=i 



'KT 



ST 1 , 



7 
7<5 



T 



7 J--7 
4a 2 ry \§a\r\K 



7(1 — 7) 7 2 5 2 



(1 + lnT) 



In if 
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Since U T > max* Y%=i r t + ^tcj, and p T Y,1=i( r t + A*c t ) < max, Y*=i r \ + ^tc*, 
then we have with probability 1 — e, 



^2p T r t - qjr t - A t (c - a* - p T c t ) +^ A(c - a t - q^c*) - ( — + — J A 2 



<5T 1 



<^-^(l T c+- a( ) + ) ,CT +T 
r-f 1 - 7 7 1 



Waf-qK 



(1 + lnT) 



In if 47 



7<5 7(1-7) 

T \ 



< 



7 2 <S 2 v 7 ?? ^(1-7) 

aiVT 7T C7T 



1-7 0(1-7) /3 1-7 
4a 2 



max + A t c* 



a / 4ai A' , — 



1 s 



(3(1 - i)K /3 7 <5 2 
Then 



16a? | | pjKhaJ QS+l , 4 7 T «i + 1 



5 0(1-7) * 



P Tr * ~ Pt*t 



ELi((! - 7)(c - a t ) - p t T c t ) 



t=i 



2(ST+l/rj) 



r= Ci-yT r— Aa x k r~ 4a 2 16a?, 

P(KlnK)6+l 6 + 1 

7 5 1 p8 ' 

Let 7 = 0{T-^ A ), rj = 0(T-^ 4 ), then we obtain 

T T 

max p T V r t - V r\ < 0(T 3 ^/y/6) and 
pTc ^ C0 t=i t=i 



^(c -p t T c) 



t=l 



when (5 = 0(T -1 / 4 ), the regret bound is <3(T 7 / 8 ), the worse case constraint 
bound is 0(T 7 / 8 ). 



